Rank Join Queries in NoSQL Databases
نویسندگان
چکیده
Rank (i.e., top-k) join queries play a key role in modern analytics tasks. However, despite their importance and unlike centralized settings, they have been completely overlooked in cloud NoSQL settings. We attempt to fill this gap: We contribute a suite of solutions and study their performance comprehensively. Baseline solutions are offered using SQLlike languages (like Hive and Pig), based on MapReduce jobs. We first provide solutions that are based on specialized indices, which may themselves be accessed using either MapReduce or coordinator-based strategies. The first index-based solution is based on inverted indices, which are accessed with MapReduce jobs. The second index-based solution adapts a popular centralized rank-join algorithm. We further contribute a novel statistical structure comprising histograms and Bloom filters, which forms the basis for the third index-based solution. We provide (i) MapReduce algorithms showing how to build these indices and statistical structures, (ii) algorithms to allow for online updates to these indices, and (iii) query processing algorithms utilizing them. We implemented all algorithms in Hadoop (HDFS) and HBase and tested them on TPC-H datasets of various scales, utilizing different queries on tables of various sizes and different score-attribute distributions. We ported our implementations to Amazon EC2 and ”in-house” lab clusters of various scales. We provide performance results for three metrics: query execution time, network bandwidth consumption, and dollar-cost for query execution.
منابع مشابه
A MapReduce Approach to NoSQL RDF Databases
In recent years, the increased need to house and process large volumes of data has prompted the need for distributed storage and querying systems. The growth of machine-readable RDF triples has prompted both industry and academia to develop new database systems, called “NoSQL,” with characteristics that differ from classical databases. Many of these systems compromise ACID properties for increa...
متن کاملJTop Algorithms for Top-k Join Queries
Top-k join queries have become very important in many important areas of computing. One of the most efficient algorithms for top-k join queries is the Rank-Join algorithm [17] [18]. However, there are many cases where Rank-Join does much unnecessary access to the input data sources. In this report, we first show that there are many cases where Rank-Join's stopping mechanism is not efficient, an...
متن کاملConsistent Join Queries in Cloud Data Stores
NoSQL Cloud data stores provide scalability and high availability properties for web applications, but do not support complex queries such as joins. Developers must therefore design their programs according to the peculiarities of NoSQL data stores rather than established software engineering practice. This results in complex and error-prone code, especially when it comes to subtle issues such ...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 7 شماره
صفحات -
تاریخ انتشار 2014